DOCUMENT RESUME 



ED 414 336 



TM 027 869 



AUTHOR 

TITLE 

PUB DATE 
NOTE 



PUB TYPE 
EDRS PRICE 
DESCRIPTORS 



IDENTIFIERS 



Crehan, Kevin D. 

A Discussion of Analytic Scoring for Writing Performance 
Assessments . 

1997-10-00 

10p.; Paper presented at the Annual Meeting of the Arizona 
Educational Research Association (Phoenix, AZ, October 
1997) . 

Reports - Evaluative (142) -- Speeches/Meeting Papers (150) 

MF01/PC01 Plus Postage. 

Evaluation Methods; Feedback; *Generalizability Theory; 
*Interrater Reliability; * Performance Based Assessment; 
Scoring; *Test Reliability; Writing (Composition) ; *Writing 
Tests 

* Analytic Scoring; Scoring Rubrics 



ABSTRACT 



Writing fits well within the realm of outcomes suitable for 
observation by performance assessments. Studies of the reliability of 
performance assessments have suggested that interrater reliability can be 
consistently high. Scoring consistency, however, is only one aspect of 
quality in decisions based on assessment results. Another is 
generalizability . Research suggests that if the number of ratings per task 
could be increased, it may yield an increase in "task” generalizability 
without a dramatic increase in the actual number of tasks. Multitrait 
analytic scoring strategies for writing performance assessments may increase 
"task" generalizability over a single holistic score. Research undertaken by 
G. Roid (1994) supports the potential usefulness of analytic scores as 
effective sources for feedback to students and as bases for meaningful 
discussion on the writing process. Work at the Center for the Study of 
Evaluation at the University of California, Los Angeles, has expanded on the 
development of methodology and uses of analytic scoring. Work on 
narrative-writing-specific scoring rubrics has shown promising evidence of 
reliability and validity. Training in and use of these rubrics has also 
increased participating teachers' understanding of the quality components of 
writing. (Contains 3 figures and 15 references.) (SLD) 
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The most prevalent response to the call for assessment reform has been to increase the 
use of more authentic assessments, e.g., performance assessments. Advocates of 
performance assessments suggest that this from of appraisal can serve to measure important 
and complex learning outcomes and provide information useful to guide improvement in 
instruction (Resnick & Resnick, 1989). Perhaps the most complex form of student 
achievement which we attempt to assess involves composition. Therefore, the task of writing 
fits well within the realm of outcomes suitable for observation by performance assessments. 

Among the problems associated with using performance assessments to measure 
important learning outcomes are objectivity of ratings and generalizability (reliability) of 
scores across raters and tasks. A review by Linn (1993) summarized evidence of acceptable 
generalizability across raters given well-defined scoring rubrics, intensive 
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rater training, and monitoring during rating. Additionally, the California Assessment 
Program has established an inter- rater reliability of .90 for their writing assessment by using 
procedures which include providing sample anchor papers for each rater and recirculating 
previously scored papers to check on stability (U.S. Congress, Office of Technology 
Assessment, 1992). Shavelson, Baxter, and Pine (1992) observed the reliability and validity 
of performance assessments in the 5th and 6th grade science curriculum. They asked the 
question: How large a sample of observers is needed to produce reliable measurement? 

Their results found inter-rater reliability to be consistently high in evaluating student 
performance on complex tasks, high enough to conclude that a single rater provides a reliable 
score. 

While these observations offer promise for the utility of performance assessments, 
scoring consistency is only one aspect of quality in decision situations based on assessment 
results. Linn and Burton (1994) suggest that for pass-fail decisions involving individual 
students, acceptable generalizability across tasks is attained only when a large number of 
tasks are used, perhaps as many as ten. If the content aera is being assessed in writing, such 
a large number of writing tasks on an occasion might require an unreasonable expenditure of 
instructional time devoted to assessment to say nothing of the administration and scoring 
costs. However, if the number of ratings per task could be increased, it may yield an 
increase in "task" generalizability without a dramatic increase in the actual number of tasks. 
Multitrait analytic scoring strategies for writing performance assessments may increase "task" 
generalizability over a single holistic score. 

Much of the research on the psychometric characteristics of writing performance 
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assessments uses single score "holistic" ratings. In writing assessment this single holistic 
score designed to estimate the wholeness in quality of the writing product. There is 
agreement (e.g., Huot, 1990) that writing is a multifaceted performance and as such involves 
attainment on a number of mental traits, e.g., vocabulary, language mechanics (see Figure 
1), on which individual differences exist. Additionally, there are different types of writing, 
e.g., narrative, expository (see Figure 2). Given that writing performance involves a number 
of traits on which individuals differ, analytic scoring of writing products is recommended by 
some researchers (see Figure 3) (Roid, 1994; Huot, 1990; Marsh & Ireland, 1987; Novak, 
Herman, & Gearhart, 1996). 

Roid (1994) used cluster analyses to explore the empirical validity of the analytic 
traits presented in Figure 1 . Results of these analyses demonstrated that, while forty percent 
of the responses had flat trait patterns (either all high or low), a number of distinct patterns 
among the six traits were evidenced. For example, thirteen percent of the patterns were very 
close to average on five of the traits but either high or low on conventions. Ten percent of 
the patterns showed high or low voice, with other scores near average. An additional 
thirteen percent were either high or low on ideas, organization, and voice but close to 
average on word choice, sentence fluency, and conventions. This suggests evidence of a 
creative or stylistic component among the six traits. This evidence supports the potential 
usefulness of analytic scores as effective sources for feedback to students and as bases for 
meaningful discussion on the writing process. 

Work at the Center for the Study of Evaluation, National Center for Research on 
Evaluation, Standards, and Student Testing, at UCLA (e.g., Wolf & Gearhart, 1993a; 
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1993b) has expanded on the development of methodology and uses of analytic scoring. 

Work on narrative-writing-specific scoring rubrics has shown promising evidence of 
reliability and validity (Gearhart, Herman, Novak, Wolf, & Abedi, 1994; Gearhart, Herman, 
& Novak, 1996). Additionally, training and use of these rubrics has benefited instruction by 
increasing participant teachers’ understanding of the quality components of writing (Gearhart 
& Wolf, 1994; Gearhart et al., 1994, Wolf & Gearhart, 1995). 
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Figure 1 

Definitions of Analytic Traits (Spandel & Stiggins, 1994) 
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Ideas 



Organ- 

ization 



Voice 



Word 

Choice 



Sentence 

Fluency 



Conven- 

tions 



The heart of the message, the content of piece, the main theme, 
together with all the details that enrich and develop that theme. 
Ideas are strong when the message is clear and enlivened with 
interesting and important details. 

The internal structure of a piece of writing, the thread of 
central meaning, the pattern that holds everything together. 
Organization is strong when the piece begins meaningfully, 
proceeds logically, and creates a sense of anticipation that is 
ultimately systematically fulfilled. 

The writer coming through the words, his or her wit and 
feeling, the sense that a real person is speaking to us and cares 
about the message. Good writers impart a personal tone and 
flavor to the piece that is unmistakably his or her’s alone. 

The use of rich, colorful, precise language that communicates 
not just in a functional way but in a way that moves and 
enlightens the reader. Strong word choice may depend more on 
the skill of using words precisely than on an exceptional 
vocabulary. 

The rhythm and flow of the language, the sound of word 
patterns, the way in which the writing plays to the ear - not 
just to the eye. With good fluency, sentences vary in length 
and style, and they are so well-crafted that reading aloud is a 
pleasure. 

The mechanical correctness of the piece - spelling, grammar, 
usage, paragraphing, capitalization , and punctuation. Writing 
that is strong in convention has been well proofread and edited. 




Descriptive 



Persuasive 



Expository 



Narrative 



Imaginative 



Figure 2 

Modes of Writing (Roid, 1994) 

Describes an object, place, or person, enabling the reader 
to visualize what is being described and to feel that he or 
she is very much part of the writer’s experience. Writer’s 
purpose is to create a strong and vivid image of 
impression in the reader’s mind. 

Attempts to convince the reader that a point of view is 
valid or persuade the reader to take a specific action. 
Writer’s purpose is to persuade the reader. 



Gives information, explains something, clarifies a process, 
or defines a concept, Writer’s purpose is to inform, 
clarify, explain, define, or instruct. 

Recounts a personal experience or tells a story based on a 
real event. Writer’s purpose is to recount an experience 
or tell a story in a concise and focused way to create some 
central theme or impression in the reader’s mind. 

Tells a story based on the writer’s imagination. The story 
is basically fictional, but the writer may use his or her 
experience and knowledge of people or situations to bring 
a special flair or flavor to the writing. Writer’s purpose 
is to entertain the reader or write for the author’s own 
pleasure. 
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Figure 3 

Advantages and Limitations of Multifaceted Analytic Scoring 
(Gearhart & Wolf, 1994; Gearhart, Wolf, Burkey, & Whittaker, 1994; 

Spandel & Stiggins, 1990; Wolf & Gearhart, 1995) 

Advantages: 

1. Developing the analytic scoring rules forces judgements on what 
is valued in writing and the product provides an operational 
definition for the quality characteristics of writing. 

2. Allows more systematic and detailed feedback to students on the 
strengths and weaknesses of their writing. 

3. Provides more diagnostic information that teachers may use to 
guide their instruction and student practice. 

4. Benefits the teachers who are trained in the rating method and 
subsequently perform the ratings. These teachers can use what 
they learn to improve their writing instruction and feedback to 
students. 

5. Ratings on multiple facets of the domain of writing skills allows 
improved generalizability over a single holistic score. 

Limitations: 

1. Analytic scoring can be very expensive and time consuming if not 
well managed. 

2. The analytic rating task is not for everybody. The rating task is 
initially difficult and beginning raters may experience frustration. 
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